This document explores some options for trained ensembles we could start using for COVID-19. We focus on results for incident cases and deaths only, because a complete set of results for hospitalizations and cumulative deaths is not available. The last forecast date evaluated is 2021-01-18 because complete results aren't available for the full history method described below for the week of 2021-01-25 (I stopped running estimation after 3 days).

Overall Scores

These scores summarize model skill for each combination of base target and spatial scale.

For brevity, we'll look here at performance for a subset of the variations on "trained" approaches that we have considered. Below are the settings we're examining, and reasons we chose them from among the alternatives.

  • We use the constraint that the model weights are non-negative and sum to 1, and we do not include an intercept. A more flexible variation only enforces that the weights are non-negative and includes an intercept; overall, the performance of this method can slightly better for cases than the convex versions, but its performance seems less stable, with a lot of variation in performance for different window sizes -- and it is consistently much worse for deaths. I have stuck with the more constrained method with more stable performance.
  • Missing forecasts are mean-imputed and then weights are redistributed according to missingness levels; this approach has limitations and needs refinement, but has been better than performing estimation separately for each group of locations with complete data in every evaluation I've looked at.
  • We do not employ any checks of model forecasts other than the validations performed on submission. I have not looked at approaches using these checks recently, but in analyses from a few months ago they were not very helpful for trained ensembles.

Within these settings, we explore variations in the training set window size (the number of past weeks of forecasts used to estimate ensemble weights). For state and national level forecasts, we consider a range of window sizes from 3 weeks to 10 weeks, and a "full history" method that goes back to the first week with forecasts from at least two component models. For county level forecasts, we restrict to just a window size of 3 weeks because a larger window size is computationally infeasible for generating forecasts in real time.

We also consider three quantile grouping strategies: "per model" weights, "per quantile" approaches where there is a separate weight parameter for each combination of model and quantile level, and "3 groups" of quantile levels: the three lowest, the three highest, and the middle ones.

Finally, we consider a "prospective selection" method that, each week, chooses the ensemble method with best average WIS in previous weeks. However, in these results that prospective selection was not able to choose the "full history" ensemble because those results were not available at the time the prospective selection method was run.

We compare to two "untrained" ensembles: an equally-weighted mean (ew) at each quantile level and a median at each quantile level.

We perform estimation either separately for each spatial scale (National, State, and County), or jointly across the State and National levels.

The overall average scores in the tables below are computed across a comparable set of forecasts for all models, determined by the model evaluated with the fewest available forecasts (corresponding to a training set window of 10). For incident deaths, the relative rankings of median and mean ("ew") can change as a few more weeks are added or removed from the evaluation set. Per-week scores plotted further down are computed across a comparable set of forecasts for all models that are available within each week.

Incident Cases

National

National level mean scores across comparable forecasts for all methods.

State

State level mean scores across comparable forecasts for all methods:

County

County level mean scores across comparable forecasts for all methods:

Incident Deaths

National

National level mean scores across comparable forecasts for all methods:

State

State level mean scores across comparable forecasts for all methods:

Plots showing scores by week

In these plots we show results for the mean, median, the prospective selection method, and a variation on the method using the full history that had reasonably good performance. For state level results, this is the variation with 3 quantile groups, and estimation_grouping == "state"; for national level results, it's the variation with 3 quantile groups and estimation_grouping == "state_national".

WIS by week

MAE by week

Two-sided interval coverage by week:

50%

80%

95%

Forecast Score Availablity

This section displays heat maps showing score availability by date, target_variable, spatial scale, and model. In each cell, we expect to see a number of scores equal to the number of locations for the given spatial scale times the number of horizons for the given target.

All forecasts

County

State

National

Forecasts available for all models that are available within each combination of base target and spatial scale

Here we have subset the forecasts to those that are comparable across all models within each combination of base target and spatial scale. We expect to see the exact same score counts for all models within each plot facet. Average scores computed within a combination of base target and spatial scale will be comparable.

County

State

National

Forecasts available for all models that are available within each combination of base target, spatial scale, and week

Here we have subset the forecasts to those that are comparable across all models within each combination of base target, spatial scale, and week. We expect to see the exact same score counts within each column of the plot, for all models for which any forecasts are available. Average scores computed within a combination of base target, spatial scale, and forecast week will be comparable.

County

State

National